NLP (Natural Language Processing)¶

Introduction¶

  • NLP - Text analysis, Sentiment analysis
  • Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

NLP can be broadly categorized into two types:¶

Natural Language Understanding (NLU):¶

it helps the machine to understand and analyse human language by extracting the metadata from content such as concepts, entities, keywords, emotion, relations, and semantic roles. NLU mainly used in Business applications to understand the customer's problem in both spoken and written language.

Natural Language Generation (NLG):¶

Natural Language Generation (NLG) acts as a translator that converts the computerized data into natural language representation. It mainly involves Text planning, Sentence planning, and Text Realization

image.png

How to compresse the text data¶

  • Paragrph into sentance
  • sentance into word
  • word anlysis (stop words - meaningless word ) - Remove stop words

NLP Preprocessing steps:¶

image.png

  • Text cleaning
    • Lowercase
    • Remove punctuation
  • Tokenization
    • sentance tokenization
    • word tokenization
  • stop word removing
  • stemming & lemmetization

Cleaning¶

In [1]:
### Using sumbols 
import string
puntuations = string.punctuation
In [2]:
puntuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [3]:
my_str = '(Hi, Hello!!!, How are you)'

# remove putuations 
no_punt = '' 
for i in my_str: 
    if i not in puntuations:
        no_punt = no_punt + i
In [4]:
no_punt
Out[4]:
'Hi Hello How are you'
In [5]:
## Using regex 
import re  
In [6]:
re.sub(r'[^\w\s]', '', my_str)
Out[6]:
'Hi Hello How are you'
In [7]:
## Tokenization 
import nltk # Natural Language Tool kit
nltk.download('punct') 
nltk.download('wordnet')
[nltk_data] Error loading punct: Package 'punct' not found in index
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sathy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[7]:
True
In [8]:
text = '''Backgammon is one of the oldest known board games. 
Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. 
It is a two player game where each player has fifteen checkers which move 
between twenty-four points according to the roll of two dice.'''
In [9]:
## Sentance tokenization 
sentance = nltk.sent_tokenize(text)
sentance
Out[9]:
['Backgammon is one of the oldest known board games.',
 'Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.',
 'It is a two player game where each player has fifteen checkers which move \nbetween twenty-four points according to the roll of two dice.']
In [10]:
## word tokenization 
for w in sentance: 
    word = nltk.word_tokenize(w) 
    print(word)
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']
['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']

Stop words¶

In [11]:
from nltk.corpus import stopwords

nltk.download('stopwords')
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sathy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[11]:
True
In [12]:
stop_words = stopwords.words('english') 
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [13]:
### remove stop words  
sen = 'Backgammon is one of the oldest known board games.' 

word = nltk.word_tokenize(sen) 
word

new_sen = [w for w in word if not w in stop_words] 
new_sen
Out[13]:
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']

Stemming & Lemmetization¶

  • stemming - stem - chops the end of the word
  • Ex : songs - song singing - sing

image.png

In [14]:
# stemming 
from nltk.stem import PorterStemmer 
stemmer = PorterStemmer() 

input_text = 'had been changing cities mice'
input_word = nltk.word_tokenize(input_text) 
input_word
Out[14]:
['had', 'been', 'changing', 'cities', 'mice']
In [15]:
for i in input_word: 
    print(i, ':', stemmer.stem(i))
had : had
been : been
changing : chang
cities : citi
mice : mice
In [16]:
### Lemmatization 
from nltk.stem import WordNetLemmatizer 
lem = WordNetLemmatizer()  

input_text = 'had been changing cities mice'
input_word = nltk.word_tokenize(input_text) 
input_word
Out[16]:
['had', 'been', 'changing', 'cities', 'mice']
In [17]:
for i in input_word: 
    print(i, ':', lem.lemmatize(i))
had : had
been : been
changing : changing
cities : city
mice : mouse

POS Tagging¶

image.png

In [18]:
from nltk import word_tokenize 

word = word_tokenize('The sky is blue') 
word
Out[18]:
['The', 'sky', 'is', 'blue']
In [19]:
nltk.pos_tag(word)
Out[19]:
[('The', 'DT'), ('sky', 'NN'), ('is', 'VBZ'), ('blue', 'JJ')]

NLP Techniques¶

  • spelling correction
  • Bag of Words
  • TF - IDF (Term Frequency - Inverse Document Frequency)

Spelling correction¶

In [22]:
# !pip install autocorrect
In [24]:
from autocorrect import Speller 
spell = Speller()
In [25]:
spell('amberlla')
Out[25]:
'umbrella'
In [26]:
spell('ur')
Out[26]:
'ur'

Bag of Words¶

  • Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms.
  • Here are the key steps of fitting a bag-of-words model:
      1. Create a vocabulary indices of words or tokens from the entire set of documents. The vocabulary indices can be created in alphabetical order.
      1. Construct the numerical feature vector for each document that represents how frequent each word appears in different documents. The feature vector representing each will be sparse in nature as the words in each document will represent only a small subset of words out of all words (bag-of-words) present in entire set of documents.
In [27]:
doc = ['Good movie', 'Worst movie', 'Good movie and Good screenplay']
doc
Out[27]:
['Good movie', 'Worst movie', 'Good movie and Good screenplay']
In [34]:
from sklearn.feature_extraction.text import CountVectorizer 

# design a vocabulary
count_vect = CountVectorizer() 
count_vect

# create bag of words model 
bag_of_words = count_vect.fit_transform(doc)

feature_name = count_vect.get_feature_names_out()
print(feature_name)

print(bag_of_words.toarray())
['and' 'good' 'movie' 'screenplay' 'worst']
[[0 1 1 0 0]
 [0 0 1 0 1]
 [1 2 1 1 0]]
In [35]:
import pandas as pd 
pd.DataFrame(bag_of_words.toarray(), columns=feature_name)
Out[35]:
and good movie screenplay worst
0 0 1 1 0 0
1 0 0 1 0 1
2 1 2 1 1 0

ngram¶

image.png

In [38]:
# design a vocabulary
count_vect = CountVectorizer(ngram_range=(1,2)) 
count_vect

# create bag of words model 
bag_of_words = count_vect.fit_transform(doc)

feature_name = count_vect.get_feature_names_out()
print(feature_name)

print(bag_of_words.toarray())
['and' 'and good' 'good' 'good movie' 'good screenplay' 'movie'
 'movie and' 'screenplay' 'worst' 'worst movie']
[[0 0 1 1 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 0 1 1]
 [1 1 2 1 1 1 1 1 0 0]]

TF-IDF¶

  • Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

image.png

In [39]:
doc = ['Good movie', 'Worst movie', 'Good movie and Good screenplay'] 
In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vect = TfidfVectorizer() 

x = vect.fit_transform(doc)

name = vect.get_feature_names_out() 

pd.DataFrame(x.toarray(), columns=name)
Out[43]:
and good movie screenplay worst
0 0.000000 0.789807 0.613356 0.000000 0.000000
1 0.000000 0.000000 0.508542 0.000000 0.861037
2 0.463121 0.704430 0.273526 0.463121 0.000000

Spam mail ham mail classification¶

In [44]:
# import libraries 
import numpy as np   
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import wordcloud
In [46]:
import re 
import nltk 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 
stop_word = stopwords.words('english')
In [48]:
# read the dataset 
df = pd.read_csv('email spams.csv', encoding='latin')
df
Out[48]:
title Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
1334 ham Oh... Icic... K lor, den meet other day...
1335 ham Oh ! A half hour is much longer in Syria than ...
1336 ham Sometimes we put walls around our hearts,not j...
1337 ham Sweet, we may or may not go to 4U to meet carl...
1338 ham Then she buying today? Ü no need to c meh...

1339 rows × 2 columns

In [49]:
# apply label encoder 
from sklearn.preprocessing import LabelEncoder 
LE = LabelEncoder() 

df['title'] = LE.fit_transform(df['title'])
In [51]:
df.head()
Out[51]:
title Message
0 0 Go until jurong point, crazy.. Available only ...
1 0 Ok lar... Joking wif u oni...
2 1 Free entry in 2 a wkly comp to win FA Cup fina...
3 0 U dun say so early hor... U c already then say...
4 0 Nah I don't think he goes to usf, he lives aro...
In [52]:
# check info 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1339 entries, 0 to 1338
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    1339 non-null   int32 
 1   Message  1339 non-null   object
dtypes: int32(1), object(1)
memory usage: 15.8+ KB
In [53]:
# check duplictaes 
df.duplicated().sum()
Out[53]:
42
In [54]:
# drop duplicates 
df.drop_duplicates(inplace=True)
In [55]:
df.shape
Out[55]:
(1297, 2)

EDA¶

In [57]:
df.title.value_counts(normalize=True)
Out[57]:
title
0    0.855821
1    0.144179
Name: proportion, dtype: float64
In [62]:
plt.pie(df['title'].value_counts(), labels=['ham', 'spam'], autopct='%0.2f')
Out[62]:
([<matplotlib.patches.Wedge at 0x15934125d90>,
  <matplotlib.patches.Wedge at 0x15934135610>],
 [Text(-0.9890754300356166, 0.4813832087847064, 'ham'),
  Text(0.9890754300356165, -0.4813832087847066, 'spam')],
 [Text(-0.5394956891103363, 0.26257265933711255, '85.58'),
  Text(0.5394956891103362, -0.26257265933711266, '14.42')])
No description has been provided for this image
In [63]:
sns.countplot(x='title', data=df)
Out[63]:
<Axes: xlabel='title', ylabel='count'>
No description has been provided for this image

Preprocessing¶

In [66]:
# word tokenization 
def no_of_words(text):
    word = text.split()  
    word_count = len(word)
    return word_count
In [67]:
df['no of words'] = df['Message'].apply(no_of_words) 
df.head()
Out[67]:
title Message no of words
0 0 Go until jurong point, crazy.. Available only ... 20
1 0 Ok lar... Joking wif u oni... 6
2 1 Free entry in 2 a wkly comp to win FA Cup fina... 28
3 0 U dun say so early hor... U c already then say... 11
4 0 Nah I don't think he goes to usf, he lives aro... 13
In [69]:
fig, ax = plt.subplots(1,2, figsize=(10,5)) 
ax[0].hist(df[df['title']==1]['Message'].str.len(), label='spam', color = 'red')
ax[0].legend(loc='upper left')
ax[1].hist(df[df['title']==0]['Message'].str.len(), label='ham', color = 'green') 
ax[1].legend(loc='upper right')
Out[69]:
<matplotlib.legend.Legend at 0x15938f27690>
No description has been provided for this image
In [74]:
# create text processimg function 
def data_processing(text): 
    text = text.lower()  
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'<br />', '', text) 

    # tokenization  
    text_token = word_tokenize(text) 

    # stop words removal 
    filtered_text = [w for w in text_token if not w in stop_word]
    
    return " ".join(filtered_text)
In [75]:
df['new message'] = df['Message'].apply(data_processing) 
df.head()
Out[75]:
title Message no of words new message
0 0 Go until jurong point, crazy.. Available only ... 20 go jurong point crazy available bugis n great ...
1 0 Ok lar... Joking wif u oni... 6 ok lar joking wif u oni
2 1 Free entry in 2 a wkly comp to win FA Cup fina... 28 free entry 2 wkly comp win fa cup final tkts 2...
3 0 U dun say so early hor... U c already then say... 11 u dun say early hor u c already say
4 0 Nah I don't think he goes to usf, he lives aro... 13 nah dont think goes usf lives around though
In [76]:
# create function for lemmatization 
lem = WordNetLemmatizer() 

def lemmetizing(text): 
    data = [lem.lemmatize(word) for word in text] 
    return text
In [77]:
df['lematizing_data'] = df['new message'].apply(lemmetizing) 
df.head()
Out[77]:
title Message no of words new message lematizing_data
0 0 Go until jurong point, crazy.. Available only ... 20 go jurong point crazy available bugis n great ... go jurong point crazy available bugis n great ...
1 0 Ok lar... Joking wif u oni... 6 ok lar joking wif u oni ok lar joking wif u oni
2 1 Free entry in 2 a wkly comp to win FA Cup fina... 28 free entry 2 wkly comp win fa cup final tkts 2... free entry 2 wkly comp win fa cup final tkts 2...
3 0 U dun say so early hor... U c already then say... 11 u dun say early hor u c already say u dun say early hor u c already say
4 0 Nah I don't think he goes to usf, he lives aro... 13 nah dont think goes usf lives around though nah dont think goes usf lives around though
In [78]:
df['word_cunt(lem)'] = df['lematizing_data'].apply(no_of_words) 
df.head()
Out[78]:
title Message no of words new message lematizing_data word_cunt(lem)
0 0 Go until jurong point, crazy.. Available only ... 20 go jurong point crazy available bugis n great ... go jurong point crazy available bugis n great ... 16
1 0 Ok lar... Joking wif u oni... 6 ok lar joking wif u oni ok lar joking wif u oni 6
2 1 Free entry in 2 a wkly comp to win FA Cup fina... 28 free entry 2 wkly comp win fa cup final tkts 2... free entry 2 wkly comp win fa cup final tkts 2... 23
3 0 U dun say so early hor... U c already then say... 11 u dun say early hor u c already say u dun say early hor u c already say 9
4 0 Nah I don't think he goes to usf, he lives aro... 13 nah dont think goes usf lives around though nah dont think goes usf lives around though 8
In [80]:
# ham mail 
ham_mail = df[df['title'] == 0] 
ham_mail.head()
Out[80]:
title Message no of words new message lematizing_data word_cunt(lem)
0 0 Go until jurong point, crazy.. Available only ... 20 go jurong point crazy available bugis n great ... go jurong point crazy available bugis n great ... 16
1 0 Ok lar... Joking wif u oni... 6 ok lar joking wif u oni ok lar joking wif u oni 6
3 0 U dun say so early hor... U c already then say... 11 u dun say early hor u c already say u dun say early hor u c already say 9
4 0 Nah I don't think he goes to usf, he lives aro... 13 nah dont think goes usf lives around though nah dont think goes usf lives around though 8
6 0 Even my brother is not like to speak with me. ... 16 even brother like speak treat like aids patent even brother like speak treat like aids patent 8
In [82]:
text = ' '.join([word for word in ham_mail['lematizing_data']]) 
from wordcloud import WordCloud
plt.figure(figsize=(20,15)) 
wordcloud = WordCloud(max_words=200).generate(text)
plt.imshow(wordcloud)
Out[82]:
<matplotlib.image.AxesImage at 0x1593613a810>
No description has been provided for this image
In [83]:
from collections import Counter 

count = Counter() 
for text in ham_mail['lematizing_data'].values: 
    for word in text.split(): 
        count[word] += 1 

count.most_common(15)
Out[83]:
[('u', 204),
 ('im', 102),
 ('get', 72),
 ('ok', 67),
 ('like', 63),
 ('dont', 61),
 ('2', 60),
 ('got', 59),
 ('time', 56),
 ('know', 55),
 ('ill', 53),
 ('call', 53),
 ('go', 52),
 ('ltgt', 50),
 ('come', 50)]
In [84]:
ham_word = pd.DataFrame(count.most_common(15)) 
ham_word.columns = ['word', 'Count'] 
ham_word
Out[84]:
word Count
0 u 204
1 im 102
2 get 72
3 ok 67
4 like 63
5 dont 61
6 2 60
7 got 59
8 time 56
9 know 55
10 ill 53
11 call 53
12 go 52
13 ltgt 50
14 come 50
In [85]:
import plotly.express as px 

px.bar(ham_word, x='Count', y='word', title='Common words in ham_mail', color ='word')
In [86]:
# task = do for spam mail
In [87]:
x = df['lematizing_data'] 
y = df['title']
In [88]:
vect = TfidfVectorizer() 

x = vect.fit_transform(df['lematizing_data'])
In [89]:
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
In [90]:
x.shape
Out[90]:
(1297, 3968)
In [91]:
from sklearn.linear_model import LogisticRegression
In [92]:
lr = LogisticRegression() 

lr.fit(x_train, y_train)
Out[92]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [93]:
lr_pred = lr.predict(x_test) 
lr_pred
Out[93]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])
In [94]:
from sklearn.metrics import accuracy_score
In [95]:
accuracy_score(y_test, lr_pred)
Out[95]:
0.9115384615384615
In [96]:
# apply all other alogithms 
In [ ]: